Skip to content

feat: deterministic replay for recorded RLM runs#21

Merged
errantsky merged 2 commits intomainfrom
feat/deterministic-replay
Mar 6, 2026
Merged

feat: deterministic replay for recorded RLM runs#21
errantsky merged 2 commits intomainfrom
feat/deterministic-replay

Conversation

@errantsky
Copy link
Owner

Summary

Adds the ability to record LLM responses during a run and replay them later without making live API calls. This enables:

  • Debugging — replay a failed run, patch one iteration's code to test a fix
  • Regression testing — replay a successful run against a new codebase version
  • Model comparison — replay with a different model via the :live fallback
  • Cost optimization — re-execute eval steps without any LLM calls

How it works

Recording

When enable_replay_recording: true is set, the Worker emits a [:rlm, :llm, :response, :recorded] telemetry event after each successful LLM call. The EventLogHandler persists the full response text and usage metadata as :llm_response events in both the in-memory Agent and :dets TraceStore.

The original context and query are also stored in the :node_start event for depth-0 workers, so replay can recover the inputs.

Replay

RLM.replay(run_id) builds a Tape (ordered list of recorded responses) from the EventLog, then starts a new Worker that uses RLM.Replay.LLM — a process-dict-based LLM behaviour implementation that returns responses from the tape instead of calling the API. All eval'd code is re-executed normally.

Patching

RLM.replay(run_id, patch: %{0 => "new_code"}) replaces the code at iteration 0 before eval. The tape entry is still consumed to maintain iteration alignment.

Fallback

RLM.replay(run_id, fallback: :live, config: [llm_module: RLM.LLM]) uses RLM.Replay.FallbackLLM, which consumes tape entries first and switches to live LLM calls when exhausted. This handles the case where a patch causes extra iterations beyond the recorded tape length.

New modules

Module Purpose
RLM.Replay Orchestrator: replay/2 with patch/fallback/config support
RLM.Replay.Tape Struct + from_events/1 builder from EventLog/TraceStore
RLM.Replay.LLM LLM behaviour — returns tape responses via process dict
RLM.Replay.FallbackLLM LLM behaviour — tape first, then live fallback

Modified modules

Module Change
RLM.Config Added enable_replay_recording field (default: false)
RLM.Worker replay_patches struct field, tape loading in init, patch application before eval, LLM response recording telemetry
RLM.Telemetry Registered [:rlm, :llm, :response, :recorded] event
RLM.Telemetry.EventLogHandler Handler for :llm_response events + original_context/original_query in :node_start
RLM replay/2 public API, updated boundary exports

Design decisions

  • Process dict for tape state — The RLM.LLM behaviour's chat/4 doesn't have a replay-state argument. Rather than changing the behaviour (breaking all implementations), the tape lives in the Worker's process dict, matching RLM.Eval's existing pattern.
  • Code-level patches, not response-level — Patches replace the code that gets eval'd, not the LLM response. The tape entry is still consumed to maintain iteration alignment. This is the most useful granularity for debugging.
  • Recording is opt-in — Full LLM responses can be large, so enable_replay_recording defaults to false.
  • Subcall replay deferred — This replays root worker iterations only. Subcall replay (child workers with their own tapes) is a natural extension but adds significant complexity.

Test plan

  • Recording: llm_response events stored when flag enabled, skipped when disabled
  • Recording: multi-iteration responses captured in order
  • Recording: original_context/original_query stored in node_start events
  • Tape: builds from EventLog, falls back to TraceStore
  • Tape: error cases (no events, no responses)
  • Replay LLM: returns entries in order, errors when exhausted
  • Replay: produces same result as original run
  • Replay: multi-iteration runs
  • Replay: patch substitutes code at specific iterations
  • Replay: error for nonexistent runs
  • Fallback: :live falls back to real LLM when tape exhausted
  • Fallback: tape entries consumed before fallback kicks in
  • Fallback: :error (default) returns error when exhausted
  • Public API: RLM.replay/2 delegates correctly
  • End-to-end smoke test (record → replay → patch → fallback)
  • Full suite: 162 tests pass, 0 failures
  • mix compile --warnings-as-errors clean
  • mix format --check-formatted clean
  • mix docs — no new warnings

🤖 Generated with Claude Code

errantsky and others added 2 commits March 5, 2026 17:13
Enable replaying previously recorded runs without making live LLM calls.
Recorded LLM responses are stored as trace events and consumed in order
during replay, re-executing all eval'd code deterministically.

New modules:
- RLM.Replay — orchestrator with patch support for code substitution
- RLM.Replay.Tape — builds ordered response sequences from EventLog
- RLM.Replay.LLM — LLM behaviour impl using process-dict tape state

Recording infrastructure:
- enable_replay_recording config flag (default: false)
- [:rlm, :llm, :response, :recorded] telemetry event
- original_context/query stored in node_start events (depth-0)
- replay_patches field on Worker struct for code patching

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a replay patch causes extra iterations beyond what the tape
recorded, the :live fallback switches to a real LLM module instead
of returning an error. The fallback module is configurable via the
:config option's llm_module key.

New module: RLM.Replay.FallbackLLM — tries tape first, delegates
to a live LLM module when entries are exhausted.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@errantsky errantsky merged commit c111015 into main Mar 6, 2026
1 check passed
@errantsky errantsky deleted the feat/deterministic-replay branch March 6, 2026 04:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant